1 Setting up

First steps: set up the script with all the tools you need. This includes loading packages from your library of packages, using the library() command.

  • Note: The packages may first need to be installed using the install.packages() function, which defaults to looking at the CRAN repository, where curated packages are maintained.
  • You may also wish to update.packages() occasionally.
  • Some packages are available from the developer rather than through CRAN, and can be downloaded directly from GitHub using the install_github() function within the devtools package (which may be downloaded from CRAN).
### Intro to basic R using the GapMinder data set

library(tidyverse) 
### if not installed, highlight this and hit cmd-enter: install.packages(tidyverse)
### the tidyverse package is a super-package that loads the following
### packages, among others:
### * tidyr, with functions for data 'tidying'
### * dplyr, with functions for data manipulation and calculation
### * readr, with functions for reading/writing data in various formats
### * ggplot2, with functions for data visualization

library(stringr)
### This is part of the tidyverse package, but doesn't get loaded
### automatically with the 'library()' function.  It has functions
### for working with strings.

library(gapminder) # install.packages(gapminder)
### this package contains a stripped-down dataset from GapMinder.

### protip: put these in the 'setup' chunk and they will load automatically 
### when you first try to run anything in the script.

2 Data tidying

2.1 Loading data

For this exploration, we’ll use the data from the gapminder package loaded in the previous code chunk. There are other ways to get data into R, the most common being:

  • read.csv() from the base R package. Old school!
  • read_csv() from the readr package (considerably faster generally than read.csv() for large data sets). Note that when you load the tidyverse package, this gets loaded automatically.
  • there are also functions for reading Excel spreadsheets, .dbf files, spatial data files, and so on; these typically are accessed using a package (e.g. readxl) that you may need to install first.

2.2 Inspecting data

### this pulls the data from the gapminder package into an object within the 
### local environment to make it easier to access...
gap_df <- gapminder

There are some nice ways to inspect your data once you’ve loaded it into memory. Five that are particularly useful:

  • View() (or click on the variable name in the ‘Environment’ pane) for certain data types (classes) creates a tab here in the script window with an interactive view of your data.
  • summary() is pretty self-explanatory. The output looks different depending on the type of data you are looking at. It does not show you what the data looks like in its basic form, but gives you an overview of the entire set.
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
  • head() which gives you just the first 6 (default) observations in the set. This shows you what the data actually looks like, in truncated form.
head(gapminder)
## # A tibble: 6 × 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134
  • glimpse() for data frames shows you each variable, its class, and the first few instances. This is built into the tidyverse package.
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
  • str() shows you the structure of the data. For complex data classes, this lets you see how it’s all put together.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

2.3 Tidying data

2.3.1 Tidy data principles

‘Tidy data’ is a philosophy promoted among the data science community, and promoted by Hadley Wickham, a stats professor, R guru, and Chief Scientist at R Studio (he designed and developed many of the ‘new’ packages including the tidyverse family of packages).

https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Looking at the data views above, it looks like this data is already pretty “tidy”, i.e. each row is a single observation and each column is a single variable. But for fun let’s make it “untidy” then tidy it back up again. We’ll create a new object from the gapminder data with an untidy version of the data. Since it is a data frame, so I’ll add a _df tag at the end to help differentiate it.

2.3.2 tidyr package

Data tidying uses the tidyr package tools to reshape the data into a more tidy form (and into less tidy forms if you really insist), including:

  • gather() brings together data from several different variables (columns) into a single variable; a “key” column based on the original column names helps you distinguish them.
  • spread() is the opposite of gather() - take a single column of data, and a “key” column of data information, and spread the data out into columns with names taken from the “key” column.
  • separate() takes a single column with multiple values embedded in each observation (e.g. a name column with “Jane Doe”, “John Doe”, “Ebenezer Smythington”) and slice it into multiple columns (e.g. two columns, first name “Jane” etc. and last name “Doe” etc.)
  • unite() is the opposite of separate() - takes multiple columns and glues them together into one.

2.3.2.1 spread() to make our data wide format.

For our example, let’s make columns for each year, containing the population variable, for some reason. This will be “wide” format (more columns) rather than “long” format (more rows). Tidy data is typically “long” format and the tidyverse functions work better in that format.

pop_df_wide <- gap_df %>%
  select(country, continent, year, pop) %>%
  spread(key = year, value = pop)

The %>% is a ‘pipe’ operator which takes the output of one function and passes it into the next function as the first argument The tidyverse functions pretty much all take a dataframe (or tibble class as an updated version of data.frame class) as their first argument, so you can keep passing the dataframe along and modifying it in a more intuitive way, without needing temp variables along the way.

head(pop_df_wide)
## # A tibble: 6 × 14
##       country continent   `1952`   `1957`   `1962`   `1967`   `1972`
##        <fctr>    <fctr>    <int>    <int>    <int>    <int>    <int>
## 1 Afghanistan      Asia  8425333  9240934 10267083 11537966 13079460
## 2     Albania    Europe  1282697  1476505  1728137  1984060  2263554
## 3     Algeria    Africa  9279525 10270856 11000948 12760499 14760787
## 4      Angola    Africa  4232095  4561361  4826015  5247469  5894858
## 5   Argentina  Americas 17876956 19610538 21283783 22934225 24779799
## 6   Australia   Oceania  8691212  9712569 10794968 11872264 13177000
## # ... with 7 more variables: `1977` <int>, `1982` <int>, `1987` <int>,
## #   `1992` <int>, `1997` <int>, `2002` <int>, `2007` <int>

Ugh, that is ugly and hard to work with!

The same functionality could have been accomplished like this:

pop_df <- select(gap_df, country, continent, year, pop)
pop_df_wide <- spread(pop_df, key = year, value = pop)

or nested like this:

pop_df_wide <- spread(select(gap_df, country, continent, year, pop), key = year, value = pop)

But please don’t do that. Harder for other audiences (including your future self, perhaps the most important audience) to read and interpret.

2.3.2.2 gather() to make data “long” format

pop_df_narrow <- pop_df_wide %>%
  gather(key = year, value = pop, `1952`:`2007`) 
### it's rare, but if your column headings are numbers (or have spaces) you
### can put the backtick marks around them.

3 Data exploration and visualization

Now that we have some idea of the data format, let’s see if there are any questions we might be able to answer with it. Data visualization is hugely helpful for seeing relationships among variables, and the R package ggplot2 can create some nice plots quickly and reasonably intuitively.

Let’s start by looking at how per-capita GDP changes over time. How many countries and how many years are in the data set?

gap_df %>%
  select(country) %>%
  distinct() %>%
  nrow()
## [1] 142
gap_df$year %>%
  unique() %>%
  length()
## [1] 12

142 countries plotted across 12 separate points in time - that’s a lot of points. Let’s try it anyway.

3.1 Grammar of Graphics (“gg”)

The Grammar of Graphics philosophy is a way of thinking about plots as a collection of data variables, mapped to various aesthetics such as color, shape, size, etc. These variables can be represented by different geometric objects - points, lines, bars, etc. The mapping of variables to geometries and aesthetics can be controlled by different scales - e.g. color scales, axis scales, etc. There’s a cool feature called facets that allow you to quickly create small multiples of a plot, each with a variation on the underlying data set. And on top of these, there are themes to control the mechanical aspects of the overall plot - text size/color, grid lines, etc.

Note the ggplot grammar involves ‘+’ instead of ‘%>%’ – think of it as each line adding new data and aesthetic mappings, geometric objects, scales, and facet specification. It’s building the plot one element at a time, rather than piping the plot from one function to the next.

See Hadley Wickham’s take on it here: http://vita.had.co.nz/papers/layered-grammar.pdf

First let’s take the gap_df dataframe and map the year variable to the x axis and the gdpPercap to the y axis. We will assign the data to a point geometry. Lines and bars (and other cool geometries) require a few more aesthetics to properly control them; points are nice and easy.

ggplot(data = gap_df, aes(x = year, y = gdpPercap)) +
  geom_point()

OK, over time seems like per capita GDP increases; there are a few outliers; but it’s hard to really separate out the components here. What if we aesthetically map the continent variable to a color aesthetic? (We could do countries, but then the legend fills the entire screen to represent colors for 142 countries).

ggplot(data = gap_df, aes(x = year, y = gdpPercap)) +
  geom_point(aes(color = continent))

We can also create facets to show each continent separately:

ggplot(data = gap_df, aes(x = year, y = gdpPercap)) +
  geom_point(aes(color = continent)) +
  facet_wrap( ~ continent)

What is going on with Asia? we’ll come back to that.

3.2 Modifying and summarizing data

It’s often desirable to modify the data by calculating new variables (e.g. using per capita GDP and population to calculate a country’s total GDP) or by aggregating variables into groups (e.g. finding the mean or max per capita GDP for all countries within a continent).

3.2.1 dplyr package

The dplyr package provides tools to manipulate your data into more useful forms, including:

  • mutate() allows you to create new variables as combinations of other information
  • filter() allows you to select rows by whether they match certain criteria.
  • select() allows you to select, exclude, or rename columns based on column name (rename() as well)
  • distinct() eliminates duplicate rows; very helpful if you’ve eliminated some columns. Be very cautious with distinct() if you have used group_by()unique() also works like this, though I think it ignores groups.
  • group_by() allows you to create groups within your dataset that will be considered and calculated separately, though simultaneously, from other groups. ungroup() releases these groups back to the full dataset.
  • summarize() allows you to take grouped data and aggregate variables.
gap_df1 <- gap_df %>%
  group_by(continent) %>%
  mutate(gdp_total = gdpPercap * pop, ### separate multiple variable mutations using commas
         pcgdp_max = max(gdpPercap, na.rm = TRUE)) %>%
  ungroup()

# glimpse(gap_df1)
### changing the pop from 'integer' class to 'numeric' class avoids an issue 
### in the next step; 'integer' can only go up to about 2 billion, but
### other classes ('numeric', 'double') are more flexible.
### Sometimes the 'factor' class can be incredibly useful, but it can
### create weird issues at other times; let's change the continent and
### country variables to 'character' class which are more intuitive.
gap_df1 <- gap_df1 %>%
  mutate(pop       = as.numeric(pop),
         country   = as.character(country),
         continent = as.character(continent))

gap_continent_summary <- gap_df1 %>%
  group_by(continent, year) %>%
  summarize(pcgdp_mean  = mean(gdpPercap, na.rm = TRUE) %>% round(2), 
              ### pipe the mean into a round(value, number of digits) function to ditch absurd sig figs
            pop_mean    = round(mean(pop, na.rm = TRUE), 2),
              ### or just nest the mean into the round() function
            pop_total   = sum(pop, na.rm = TRUE),
            life_mean   = mean(lifeExp, na.rm = TRUE) %>% round(2),
            n_countries = n()) %>%
  ungroup()

### The DT library lets you use html/javascript interactive tables.
### You may need to first install the DT package.  
### The format here allows you to access an installed package without
### first loading the package using library(DT).  
### This format (package::function()) can also be very useful when,
### for example, two packages both have a same-named function and you want
### to make sure it uses the one from the correct package.
DT::datatable(gap_continent_summary)

3.2.1.1 Notes for the above code chunk

Note the mean, sum, median, etc functions will return NA if any of the data points in the group are NA. The na.rm = TRUE removes them before calculating. Note that summarize() collapses each grouping into a single row, and drops any columns that are not explicitly grouped or created. Using mutate() keeps all the rows and columns, and just add new variables as requested (and calculates means/maxes according to groups if any).

3.3 Assigning plots to objects

ggplot outputs can be stored as an object, which can then be further manipulated by adding more geometries/etc. That object can also be passed to other functions like print or ggplotly, or returned from within a function. If you assign the plot to an object, it suppresses the immediate displaying of the plot, but you can view the plot just by calling the plot object’s name.

pcgdp_plot <- ggplot(gap_continent_summary, aes(x = year, y = pcgdp_mean, color = continent, blargle = pop_total)) +
  geom_point() +
  labs(title = 'Per capita GDP by continent',
       x = 'Year',
       y = 'Mean per capita GDP (US$)')

# pcgdp_plot

### you can add additional layers to the plot, e.g. lines.  The new geometry
### will use the aesthetics as mapped in the original ggplot() call, unless
### you override them.  For a line, we need to tell it a 'group' so it knows
### which data points to connect.  For aesthetics that don't change based on
### the data, put those outside the aes() call.
pcgdp_plot_lines <- pcgdp_plot +
  geom_line(aes(group = continent), alpha = .3, size = 2)

# pcgdp_plot_lines

### or facets (small multiples, divided out by one of your variables)
pcgdp_facet_plot <- pcgdp_plot +
  facet_wrap( ~ continent, scales = 'fixed') # (try scales = 'free' too)

# print(pcgdp_facet_plot)

### The plotly package allows you to display interactive versions
### of ggplot objects using the ggplotly() function.
plotly::ggplotly(pcgdp_plot_lines)

As you mouse over the points in the ggplotly plot, notice that pop_total shows up… then notice in the original ggplot() call, I added a nonsense aesthetic and mapped pop_total to that. It doesn’t affect the look of the plot, but recognizes that I want that variable to show up in the pop-up window.

Now let’s see about identifying those outliers we saw earlier, with the very high per-capita GDP in Asia in the early part of the dataset. For each year, and each continent, let’s identify the poorest 25% and richest 25% of countries based on the average per-cap GDP across the entire time span. Then we can plot just the richest and inspect those outliers.

asia_df <- gap_df %>%
  group_by(country) %>%
  mutate(pcgdp_mean = mean(gdpPercap, na.rm = TRUE)) %>%
  group_by(continent) %>%
  mutate(cutoff_rich = quantile(pcgdp_mean, .75),
         cutoff_poor = quantile(pcgdp_mean, .25)) %>%
  ungroup() %>%
  mutate(wealth = ifelse(pcgdp_mean <= cutoff_poor, 'poor', 'middle'),
         wealth = ifelse(pcgdp_mean >= cutoff_rich, 'rich', wealth)) %>%
  filter(continent == 'Asia')

asia_gdp_plot <- ggplot(asia_df %>% filter(wealth == 'rich'),
                        aes(x = year, y = gdpPercap, color = country)) +
  theme_bw() + ### built-in theme to change the look of the plot
  geom_point(aes(size = pop)) +
  geom_line(aes(group = country), alpha = .3, size = 1)

plotly::ggplotly(asia_gdp_plot)

4 Linear models

Let’s try a different plot, examining whether per capita GDP and life expectancy are related.

What variables are likely to make a difference here? We can use facet_wrap() and aesthetics to compare how variables affect the relationship. We can use ggplotly() to allow for mouse-over to explore outliers.

gap_df_wealth <- gap_df %>%
  group_by(country) %>%
  mutate(pcgdp_mean = mean(gdpPercap, na.rm = TRUE)) %>%
  group_by(continent) %>%
  mutate(cutoff_rich = quantile(pcgdp_mean, .75),
         cutoff_poor = quantile(pcgdp_mean, .25)) %>%
  ungroup() %>%
  mutate(wealth = ifelse(pcgdp_mean <= cutoff_poor, 'poor', 'middle'),
         wealth = ifelse(pcgdp_mean >= cutoff_rich, 'rich', wealth))
         
pcgdp_vs_life_plot <- ggplot(gap_df_wealth %>%
                               filter(continent == 'Asia') %>%
                               filter(wealth %in% c('poor', 'rich')),
                             aes(x = gdpPercap, y = lifeExp, asdf = year)) +
  geom_point(aes(size = pop, color = country), alpha = .5) + ### use alpha to better see overlapping data pts
  scale_color_discrete(guide = 'none') +    ### hides color scale from the legend ('guide')
  geom_line(aes(color = country, group = country), size = .5, alpha = .5) +
  scale_y_continuous(limits = c(0, NA)) +   ### force 0 on y axis
  facet_grid( ~ wealth, scales = 'free_x')  ### does free_x distort perception of data?

pcgdp_vs_life_plot <- pcgdp_vs_life_plot +
  theme_bw() +### themes allow for changing the style of a plot.
  labs(title = 'Per-cap GDP vs life expectancy in Asia, 1952-2007')

print(pcgdp_vs_life_plot)

There are some clear patterns as we follow the countries through time; but let’s eliminate the time variable and just look at the most recent year of data.

pcgdp_life_2007_plot <- ggplot(gap_df_wealth %>%
                                 filter(year == max(year)),
                               aes(x = gdpPercap, 
                                   y = lifeExp,
                                   blargle = country,
                                   shape = continent)) +
  theme_bw() +
  geom_point(aes(size = pop, color = wealth), alpha = .7) + ### use alpha to better see overlapping data pts
  scale_y_continuous(limits = c(0, NA))  ### force 0 on y axis

print(pcgdp_life_2007_plot)

Seems like there is a definite pattern there too, but it is clearly not linear. Let’s see if we can linearize the data a bit, by taking a log transformation (base 10) of the gdpPercap variable. From this we can create a linear model of log(gdp) versus life expectancy. Note that this is looking across all years… so let’s map year to an aesthetic, in case a separate pattern emerges.

gap_df_log <- gap_df_wealth %>%
  mutate(log_gdp = log10(gdpPercap))

ggplot(gap_df_log, aes(x = log_gdp, y = lifeExp)) +
  geom_point(aes(color = year), alpha = .7) +
  scale_color_distiller(type = 'seq', palette = 'Spectral')

That seems to create a more linear pattern; note that there is a separate trend where it seems like more recent years are pushed upward relative to older years (which makes sense, right?). Let’s create a linear model based on this, and plot a trend line.

gap_log_model <- lm(formula = lifeExp ~ log_gdp, data = gap_df_log)
summary(gap_log_model)
## 
## Call:
## lm(formula = lifeExp ~ log_gdp, data = gap_df_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.1009     1.2277  -7.413 1.93e-13 ***
## log_gdp      19.3534     0.3425  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

Not a terrible model perhaps? would it be better or worse if we just looked at the most current year? Try it and let me know.

In the mean time let’s plot and add a trend line. Note the linear model output is not a data frame; it’s a different class of object: a list. (so is a ggplot object!). And as such we need to understand a little more about complex data structures in R. This chunk explores the structure (str()) and accesses the coefficients within the structure in a couple of different ways.

str(gap_log_model)
## List of 12
##  $ coefficients : Named num [1:2] -9.1 19.4
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "log_gdp"
##  $ residuals    : Named num [1:1704] -18.1 -17 -15.6 -13.4 -10.3 ...
##   ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:1704] -2455.08 430.51 -14.77 -12.57 -9.44 ...
##   ..- attr(*, "names")= chr [1:1704] "(Intercept)" "log_gdp" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:1704] 46.9 47.3 47.6 47.5 46.4 ...
##   ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "log_gdp"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.02 1.03
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 1702
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = lifeExp ~ log_gdp, data = gap_df_log)
##  $ terms        :Classes 'terms', 'formula'  language lifeExp ~ log_gdp
##   .. ..- attr(*, "variables")= language list(lifeExp, log_gdp)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "lifeExp" "log_gdp"
##   .. .. .. ..$ : chr "log_gdp"
##   .. ..- attr(*, "term.labels")= chr "log_gdp"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(lifeExp, log_gdp)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "log_gdp"
##  $ model        :'data.frame':   1704 obs. of  2 variables:
##   ..$ lifeExp: num [1:1704] 28.8 30.3 32 34 36.1 ...
##   ..$ log_gdp: num [1:1704] 2.89 2.91 2.93 2.92 2.87 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language lifeExp ~ log_gdp
##   .. .. ..- attr(*, "variables")= language list(lifeExp, log_gdp)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "lifeExp" "log_gdp"
##   .. .. .. .. ..$ : chr "log_gdp"
##   .. .. ..- attr(*, "term.labels")= chr "log_gdp"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(lifeExp, log_gdp)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "log_gdp"
##  - attr(*, "class")= chr "lm"
gap_log_model$coefficients
## (Intercept)     log_gdp 
##   -9.100889   19.353423
gap_log_model[['coefficients']]
## (Intercept)     log_gdp 
##   -9.100889   19.353423
gap_log_model$coefficients[1] ### the intercept; [2] is the slope based on lifeExp as a function of log_gdp
## (Intercept) 
##   -9.100889

A slope of 19.3534233 means that for every one point increase in log_gdp (which corresponds to a 10x increase in per capita GDP) we get that many additional years of life expectancy!

mdl_int   <- gap_log_model$coefficients[1]
mdl_slope <- gap_log_model$coefficients[2]

label_x <- 3
label_y <- mdl_slope * label_x + mdl_int - 5

gap_df_log <- gap_df_log %>%
  mutate(wealth = factor(wealth, levels = c('poor', 'middle', 'rich')))
### note the use of factor here, instead of character, to force the order of
### the wealth categories in a categorical variable.

pcgdp_log_plot <- ggplot(gap_df_log, aes(x = log_gdp, y = lifeExp)) +
  geom_point(aes(color = wealth, size = pop), alpha = .5) +
  scale_color_manual(values = c('poor' = 'coral3', 'middle' = 'cornflowerblue', 'rich' = 'green3')) +
  geom_abline(intercept = gap_log_model$coefficients[1],
              slope     = gap_log_model$coefficients[2],
              color     = 'red') +
  annotate('text', x = label_x, y = label_y, 
           hjust = 0,
           label = sprintf('lifeExp = %.2f * log_gdp + %.2f + ε', mdl_slope, mdl_int))

print(pcgdp_log_plot)

5 Conclusion

I hope this helps with some basics of R, and provides some examples of code to reverse-engineer and use for your own projects.

This little tutorial was created using R Markdown; it is a handy way of communicating analyses in R, especially once you can use GitHub to publish your results directly to the web for others to use and admire.